---
title: OM1 Video Processor
description: "OM1 Video Processor"
---

The OM1 Video Processor is a Docker-based solution that enables real-time video streaming, face recognition, and audio capture for OM1 robots. It's designed to work seamlessly with your robot's hardware while providing intelligent audio andvideo processing capabilities.

## Key Features

- **Real-time Face Recognition**: Identify and label faces in the video stream with bounding boxes
- **Audio-Visual Streaming**: Simultaneous video and audio capture and streaming
- **Hardware Acceleration**: Optimized for NVIDIA Jetson platforms with CUDA support
- **Easy Integration**: Simple Docker-based deployment
- **Performance Monitoring**: Built-in FPS counter and system metrics
- **Configurable devices**: Supports multiple camera and microphone configurations
- **Direct RTSP streaming**: Streams directly to OpenMind's API without intermediate relay

## What is RTSP?
- RTSP (Real Time Streaming Protocol) is a network control protocol designed to manage multimedia streaming sessions.
- It functions as a "remote control" for media servers, establishing and controlling one or more time-synchronized streams of continuous media such as audio and video.

### Key characteristics:

- Control Protocol: RTSP manages streaming sessions but does not typically transport the media data itself
- Session Management: Establishes, maintains, and terminates streaming sessions
- Time Synchronization: Coordinates multiple media streams (audio/video) to play in sync
- Network Remote Control: Provides VCR-like commands (play, pause, stop, seek) for media playback over a network

### Architecture Diagram

![Architecture Diagram](../assets/om1_video_processor_architecture.png)

## Openmind privacy system

It runs entirely on the **robot's edge device** and automatically blurs the faces to prevent any personal information from leaking. Frames never leave the device; only the blurred output is saved or streamed. It works offline and keeps latency low for real-time use.

### How it works

- Find faces (SCRFD) – Each frame is scanned with the face detector. The model is robust to different angles and lighting. It is optimized by tensorrt so the inference is pretty fast.
- After it locates the face with bounding box, we expand the region around the face bounding box, create a smooth mask, and apply strong Gaussian blur  so identity wouldn't leak or recovered.
We priotize safety and want to protect everyone's identity – when in doubt (low confidence, motion blur, occlusion), we focus on the side of privacy and blur anyway.

## System Requirements

- Docker and Docker Compose installed
- NVIDIA Jetson device with JetPack 6.1 (or compatible NVIDIA GPU system)
- Access to a video capture device - via USB camera or built in webcam (default: /dev/video0)
- Microphone (for audio streaming - default: default_mic_aec)
- OpenMind API credentials
- Linux system with V4L2 and ALSA support

## Installation

1. Clone the repository:
   ```bash
   git clone https://github.com/OpenMind/OM1-video-processor.git
   cd OM1-video-processor
   ```
2. Set environment variables:

    Get OM_API_KEY and OM_API_KEY_ID from [OpenMind portal](https://portal.openmind.org/).
    Once you generate a new API key, copy the key and paste it in the OM_API_KEY environment variable.

    To get your API key ID, copy the 16 digit id from your API key as highlighted in the image below:
    ![OM_API_KEY_ID](../assets/om_api_key_id.png)
    ```bash
    export OM_API_KEY_ID="your_api_key_id"
    export OM_API_KEY="your_api_key"
    ```

3. Configure your devices (Optional):
   ```bash
   export CAMERA_INDEX="/dev/video6"    # Default camera device
   export MICROPHONE_INDEX="default_mic_aec"     # Default microphone device
   ```

4. Ensure devices are accessible
   ```bash
   # Check available video devices
   ls /dev/video*

   # List video devices with v4l2
   v4l2-ctl --list-devices

   # Check available audio devices
   pactl list sources short
   pactl list sinks short
   ```

## Quick Start

1. Start the streaming service:
   ```bash
   docker-compose up -d
   ```

2. The system will automatically:
   - Initialize the camera and microphone
   - Start face recognition processing
   - Stream to the configured RTSP endpoint

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `CAMERA_DEVICE` | Camera device path | `/dev/video0` |
| `AUDIO_DEVICE` | Audio input device | `hw:3,0` |
| `RTSP_URL` | RTSP server endpoint | `rtsp://your-rtsp-server/stream` |
| `FPS` | Target frames per second | `30` |

## Configuration

The system is configured the following components:

### Docker Compose Configuration

The docker-compose.yml file configures:

- NVIDIA runtime: GPU acceleration for face recognition processing
- Network mode: Host networking for direct device access
- Privileged mode: Required for camera and audio device access
- Device mapping: Camera (default /dev/video0) and audio (/dev/snd) devices
- Environment variables: OpenMind API credentials, device indices, and PulseAudio configuration
- Shared memory: 4GB allocated for efficient video processing

### Processing Pipeline

The streaming pipeline consists of two processes managed by Supervisor:

- MediaMTX: RTSP server for stream routing and management
- OM Face Recognition Stream: Main processing service that:
  - Captures video from the specified camera device
  - Performs real-time face recognition with GPU acceleration
  - Overlays bounding boxes, names, and FPS information
  - Captures audio from the specified microphone
  - Streams directly to OpenMind's RTSP ingestion endpoint

### Environment Variables

We need to configure the following environment variables:

- OM_API_KEY_ID: Your OpenMind API key ID (required)
- OM_API_KEY: Your OpenMind API key (required)
- CAMERA_INDEX: Camera device path (default: /dev/video0)
- MICROPHONE_INDEX: Microphone device identifier (default: default_mic_aec)

## How It Works

1. **Video Capture**: Captures video from the specified camera device
2. **Face Processing**: Uses AI to detect and recognize faces in real-time
3. **Audio Capture**: Simultaneously records audio from the microphone
4. **Streaming**: Combines video and audio into an RTSP stream
5. **Monitoring**: Provides real-time performance metrics

### Ports
The following ports are used internally:

- 8554: RTSP (MediaMTX local server)
- 1935: RTMP (MediaMTX local server)
- 8889: HLS (MediaMTX local server)
- 8189: WebRTC (MediaMTX local server)

## Development

### Build the image:
```bash
docker-compose build
```

### Customize the processing settings:
To modify the om_face_recog_stream parameters, edit the command in video_processor/supervisord.conf.
```bash
vim video_processor/supervisord.conf
```

Modified parameters -
```bash
--device: Camera device path
--rtsp-mic-device: Microphone device identifier
--draw-boxes: Enable/disable bounding box overlays
--draw-names: Enable/disable name overlays
--show-fps: Enable/disable FPS display
--no-window: Run in headless mode
--remote-rtsp: OpenMind RTSP ingestion endpoint
```

```bash
# Install dependencies locally
uv sync --all-extras

# Run the face recognition stream locally
uv run om_face_recog_stream --help
```

## Troubleshooting

### Common Issues

- **No video feed**:
  - Verify camera device permissions
  - Check if the camera is being used by another application

- **Audio not working**:
  - Verify the correct audio device is specified
  - Check PulseAudio configuration

- **Performance issues**:
  - Ensure hardware acceleration is properly configured
  - Reduce resolution or FPS if needed
